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Abstract 

In this paper we consider high dimensional sparse regression, and develop strategies able to deal with arbitrary - 
possibly, severe or coordinated - errors in the covariance matrix X. These may come from corrupted data, persistent 
experimental errors, or malicious respondents in surveys/recommender systems, etc. Such non-stochastic error-in- 
variables problems are notoriously difficult to treat, and as we demonstrate, the problem is particularly pronounced 
in high-dimensional settings where the primary goal is support recovery of the sparse regressor. We develop 
algorithms for support recovery in sparse regression, when some number m out of n + n\ total covariate/response 
pairs are arbitrarily (possibly maliciously) corrupted. We are interested in understanding how many outliers, n\, 
we can tolerate, while identifying the correct support. To the best of our knowledge, neither standard outlier 
rejection techniques, nor recently developed robust regression algorithms (that focus only on corrupted response 
variables), nor recent algorithms for dealing with stochastic noise or erasures, can provide guarantees on support 
recovery. Perhaps surprisingly, we also show that the natural brute force algorithm that searches over all subsets 
of n covariate/response pairs, and all subsets of possible support coordinates in order to minimize regression error, 
is remarkably poor, unable to correctly identify the support with even n\ = 0(n/k) corrupted points, where k is 
the sparsity. This is true even in the basic setting we consider, where all authentic measurements and noise are 
independent and sub-Gaussian. In this setting, we provide a simple algorithm - no more computationally taxing 
than OMP - that gives stronger performance guarantees, recovering the support with up to ri\ = 0(n/(Vklogp)) 
corrupted points, where p is the dimension of the signal to be recovered. 

I. Introduction 

Linear regression and sparse linear regression seek to express a response variable as the linear com- 
bination of (a small number of) covariates. They form one of the most basic procedures in statis- 
tics, engineering, and science. More recently, regression has found increasing applications in the high- 
dimensional regime, where the number of variables, p, is much larger than the number of measurements 
or observations, n. Applications in biology, genetics, as well as in social networks, human behavior 
prediction and recommendation, abound, to name just a few. The key structural property exploited in high- 
dimensional regression, is that the regressor is often sparse, or near sparse, and as much recent research 
has demonstrated, in many cases it can be efficiently recovered, despite the grossly underdetermined nature 
of the problem (e.g., [8], [6], [4], [12], [31]). Another common theme in large-scale learning problems - 
particularly problems in the high-dimensional regime - is that we not only have big data, but we have 
dirty data. Recently, attention has focused on the setting where the output (or response) variable and the 
matrix of covariates are plagued by erasures, and/or by stochastic additive noise [23], [26], [27], [9], [10]. 
Yet many applications, including those mentioned, may suffer from persistent errors, that are ill-modeled 
by stochastic distribution; indeed, many applications, particularly those modeling human behavior, may 
exhibit maliciously corrupted data. 

This paper is about extending the power of regression, and in particular, sparse high-dimensional 
regression, to be robust to this type of noise. We call this deterministic or cardinality constrained 
robustness, because rather than restricting the magnitude of the noise, or any other such property of 
the noise, we merely assume there is a bound on how many data points, or how many coordinates of 
every single covariate, are corrupted. Other than this number, we make absolutely no assumptions on 
what the adversary can do - the adversary is virtually unlimited in computational power and knowledge 
about our algorithm and about the authentic points. There are two basic models we consider. In both, we 
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assume there is an underlying generative model: y = X(5* + e, where X is the matrix of covariates, e 
is sub-Gaussian noise. In the row-corruption model, we assume that each pair of covariates and response 
we see is either due to the generative model, i.e., (y^JQ), or is corrupted in some arbitrary way, with the 
only restriction that at most n x such pairs are corrupted. In the distributed corruption model, we assume 
that y and each column of X, has n\ elements that are arbitrarily corrupted (evidently, the second model 
is a strictly harsher corruption model). Building efficient regression algorithms that recover at least the 
support of (3* accurately subject to even such deterministic data corruption, greatly expands the scope of 
problems where regression can be productively applied. The basic question is when is this possible - how 
big can n\ be, while still allowing correct recovery of the support of f3*. 

Many sparse-regression algorithms have been proposed, and their properties under clean observations 
are well understood; we survey some of these results in the next two sections. Also well-known, is that 
the performance of standard algorithms (e.g., Lasso, Orthogonal Matching Pursuit) breaks down even in 
the face of just a few corrupted points or covariate coefficients. As more work has focused on robustness 
in the high-dimensional regime, it has also become clear that the techniques of classical robust statistics 
such as outlier removal preprocessing steps cannot be applied to the high-dimensional regime [13], [18]. 
The reason for this lies in the high dimensionality. In this setting, identifying outliers a priori is typically 
impossible: outliers might not exhibit any strangeness in the ambient space due to the high-dimensional 
noise (see [33] for a further detailed discussion), and thus can be identified only when the true low- 
dimensional structure is (at least approximately) known; on the other hand, the true structure cannot be 
computed by ignoring outliers. Other classical approaches have involved replacing the standard mean 
squared loss with a trimmed variant or even median squared loss [16]; first, these are non convex, and 
second, it is not clear that they provide any performance guarantees, especially in high dimensions. 

Recently, the works in [20], [25], [32], [22] have proposed an approach to handle arbitrary corruption 
in the response variable. As we show, this approach faces serious difficulties when the covariates is also 
corrupted, and is bound to fail in this setting. One might modify this approach in the spirit of Total Least 
Squares (TLS) [36] to account for noise in the covariates (discussed in Section III), but it leads to highly 
non convex problems. Moreover, the approaches proposed in these papers are the natural convexification 
of the (exponential time) brute force algorithm that searches over all subsets of covariate/response pairs 
(i.e., rows of the measurement matrix and corresponding entries of the response vector) and subsets of 
the support (i.e., columns of the measurement matrix) and then returns the vector that minimizes the 
regression error over the best selection of such subsets. Perhaps surprisingly, we show that the brute force 
algorithm itself has remarkably weak performance. Another line of work has developed approaches to 
handle stochastic noise or small bounded noise in the covariates [17], [26], [27], [23], [9]. The corruption 
models in there, however, are different from ours which allows arbitrary and malicious noise; those results 
seem to depend crucially on the assumed structure of the noise and cannot handle the setting in this paper. 

More generally, even beyond regression, in, e.g., robust PCA and robust matrix completion [7], [5], 
[34], [11], [21], recent robust recovery in high dimensions results have for the most part depended on 
convex optimization formulations. We show in Section IV that for our setting, convex-optimization based 
approaches that try to relax the brute-force formulation fail to recover support, with even a constant 
number of outliers. Accordingly, we develop a different line of robust algorithms, focusing on Greedy- 
type approaches like Matching Pursuit (MP). 

In summary, to the best of our knowledge, no robust sparse regression algorithm has been proposed 
that can provide performance guarantees, and in particular, guaranteed support recovery, under arbitrarily 
and maliciously corrupted covariates and response variables. 

We believe robustness is of great interest both in practice and in theory. Modern applications often 
involve "big but dirty data", where outliers are ubiquitous either due to adversarial manipulation or to 
the fact some samples are generated from a model different from the assumed one. It is thus desirable to 
develop robust sparse regression procedures. From a theoretical perspective, it is somewhat surprising that 
the addition of a few outliers can transform a simple problem to a hard one; we discuss the difficulties 
in more detail in the subsequent sections. 
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Paper Contributions: In this paper, we propose and discuss a simple (in particular, efficient) algorithm 
for robust sparse regression, for the setting where both covariates and response variables are arbitrarily 
corrupted, and show that our algorithm guarantees support recovery under far more corrupted covari- 
ate/response pairs than any other algorithm we are aware of. We briefly summarize our contributions 
here: 

1) We consider the corruption model where n\ rows of the covariate matrix X and the response 
vector y are arbitrarily corrupted. We demonstrate that other algorithms we are aware of, including 
standard convex optimization approaches and the natural brute force algorithm, have very weak (if 
any) guarantees for support recovery. 

2) For the corruption model above, we give support recovery guarantees for our algorithm, showing that 
we correctly recover the support with n\ = 0(n/ (Vk logp)) arbitrarily corrupted response/covariate 
pairs. 

3) We consider a stronger corruption model, where instead of n\ corrupted rows of the matrix X, each 
column can have up to n\ arbitrarily corrupted entries. We show that our algorithm also works in 
this setting, with precisely the same recovery guarantees. To the best of our knowledge, this problem 
has not been previously considered. 



II. Problem Setup 

We consider the problem of sparse linear regression. The unknown parameter f3* G W is assumed to 
be A;-sparse (k < p), i.e., has only k nonzeros. The observations take the form of covariate-response pairs 
(xiiVi) eK p xR, i = l,...,n + ni. Among these, n pairs are authentic samples obeying the following 
linear model 

Ui = (xi,0*) + e h 

where is additive noise, and p > n. For corruption, we consider the following two models. 

Definition 1 (Row Corruption). The n\ pairs are arbitrarily corrupted, with both x,- L and being potentially 
corrupted. 

Definition 2 (Distributed Corruption). We allow arbitrary corruption of any n\/2 elements of each column 
of the covariate matrix X and of the response y. 

In particular, the corrupted entries need not lie in the same n\ rows. Clearly this includes the previous 
model as a special case up to a constant factor of 2. 

Note that in both models, we impose no assumption whatsoever on the corrupted pairs. They might 
be unbounded, non-stochastic, and even dependent on the authentic samples. They are unconstrained 
other than in their cardinality - the number of rows or coefficients corrupted. We illustrate both of these 
corruption models pictorially in Figure 1. 

Goal: Given these observations {(x i: yi)}, the goal is to obtain a reliable estimate (5 of (5* with correct 

support and bounded error (5 — (5* .A fundamental question, therefore, is to understand in each given 

2 

model, given p,n, and k, how many outliers (ni) an estimator can handle. 

Remark 3. This is a strong notion of robustness. In particular, requiring support recovery is a more 
stringent requirement than requiring, for example, bounded distance to the true solution, or bounded loss 
degradation. It is also worth noting that robustness here is a completely different notion than robustness 
in the sense of Robust Optimization (e.g., [2]). There, we seek a solution that minimizes the error we 
incur when an adversary perturbs our loss function. In contrast, here, here is a true generative model that 
obeys the structural assumptions of the problem (namely, sparsity of f3*), but an adversary corrupts the 
data that ordinarily provide us with the true input-output behavior of that model. 

We emphasize that our setting is fundamentally different from those that only allow corruptions in y, 
and robust sparse regression techniques that only consider corruption in y are bound to fail in our setting. 
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We elaborate on this point in Section IV and V. Moreover, under the distributed corruption model, there is 
no hope of considering an equivalent model with only corruptions in y: all entries of y could be corrupted 
- an absurd setting to hope for a solution. 




Figure 1 : Consider identifying the dominant genetic markers for a disease; entire experiments might fail, 
or different gene expression levels might be erroneously read across different experiments. Panel (a) 
corresponds to the first type of error, where n x samples JQ) are arbitrarily corrupted. This is our first 
error model. Panel (b) illustrates an instance of the second scenario, and thus our second error model: 
in each experiment, several covariates may be misread, but no single covariate is misread more than n\ 
times. 



III. Related Work 

Under the high-dimensional setting p > n, there is a large body of literature on sparse recovery when 
there is no corruption. It is now well-known that recovery of (3* is possible only when the covariate 
matrix X satisfies certain conditions, such as the Restricted Isometry /Eigenvalue Property [6], [3], Mutual 
Incoherence Condition [14] or Exact Recovery Condition [29]. Various ensembles of random matrices 
are shown to satisfy these conditions with high probability. Many estimators have been proposed, most 
notably Basis Pursuit (a.k.a. Lasso) [28], [14], which solves an ii -regularized least squares problem 

mm \\y - X(5\\ 2 + A \\(3\\ l , 

as well as Orthogonal Matching Pursuit (OMP) [14], [29], which is a greedy algorithm that estimates 
the support of (3* sequentially. Both Lasso and OMP, as well as many other estimators, are guaranteed 
to recover f3* with good accuracy when X is well-conditioned, and the number of observations satisfies 
n > k log p. (Here we mean there exists a constant c, independent of k,n,p, such that the statement holds. 
We use this notation throughout the paper.) Moreover, this condition is also shown to be necessary; see, 
e.g., [30]. 

Most existing methods are not robust to outliers; for example, standard Lasso and OMP fail even if 
only one entry of X or y is corrupted. One might consider a natural modification of Lasso in the spirit 
of Total Least Squares, and solve 

min \\(X - E)(3 - y\\ 2 + A||/3||i + V \\E\U, (1) 

fi,E 

where E accounts for corruption in the covariate matrix, and || • ||* is a norm. When E is known to be 
row sparse (as is the case in our row-corruption model), one might choose || ■ (I* to be || • ||i j2 or || ■ Hi^ 1 ; 
the work in [36] considers using || • ||* = || ■ ||^ (similar to TLS), which is more suitable when E is dense 
yet bounded. The optimization problem (1) is, however, highly non convex due to the bilinear term E(3, 
and no tractable algorithm with provable performance guarantees is known. 

1 1 1 -E7 1| a. ,2 (||£1|i,oo) is the sum of the £ 2 respectively) norms of the rows of E. 
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Another modification of Lasso accounts for the corruption in the response via an additional variable z 
[20], [25], [32], [22]: 

min \\xp-y-z\\ 2 + \\\P\\ 1 + 1 \\z\\ 1 . (2) 

/3,z 

We call this approach Justice Pursuit (JP) after [20]. Unlike the previous approach, the problem (2) is 
convex. In fact, it is the natural convexification of the brute force algorithm: 

min — y — z\\ 2 (3) 

P,z 

s.t. : \\P\\o<k 
\\z\\o < ni, 

where ||u|| denotes the number of nonzero entries in u. It is easy to see (and well known) that the so-called 
Justice Pursuit relaxation (2) is equivalent to minimizing the Huber loss function plus the t x regularizer, 
with an explicit relation between 7 and the parameter of the Huber loss function [15]. Formulation (2) 
has excellent recovery guarantees when only the response variable is corrupted, delivering exact recovery 
under a constant fraction of outliers. However, we show in the next section that a broad class of convex 
optimization-based approaches, with (2) as a special case, fail when the covariate X is also corrupted. 
In the subsequent section, we show that even the original brute force formulation is problematic: while 
it can recover from some number n\ of corrupted rows, that number is order-wise worse than what the 
algorithm we give can guarantee. 

We also note that neither the brute force algorithm above, nor its relaxation, JP, are appropriate for our 
second model for corruption. Indeed, in this setting, modeling via JP would require handling the setting 
where every single entry of the output variable, y, is corrupted, something which certainly cannot be done. 

For standard linear regression problems in the classical scaling n 3> p, various robust estimators have 
been proposed, including M-, R-, and S'-estimators [18], [24], as well as those based on ii -minimization 
[19]. Many of these estimators lead to non-convex optimization problems, and even for those that are 
convex, it is unclear how they can be used in the high-dimensional scaling with sparse /3*. Another 
difficulty in applying classical robust methods to our problems arises from the fact that the covariates, 
Xi, also lie in a high-dimensional space, and thus defeat many outlier detection/rejection techniques that 
might otherwise work well in low-dimensions. Again, for our second model of corruption, outlier detection 
seems even more hopeless. 



IV. Failure of the Convex Optimization Approach 
We consider a broad class of convex optimization-based approaches of the following form: 

mm f(y-X(3) (4) 

s.t. h(/3) < R. 

Here R is a radius parameter that can be tuned. Both /(•) and h(-) are convex functions, which can be 
interpreted as a loss function (of the residual) and a regularizer (of (3), respectively. For example, one 
may take f(v) = min z ||t> — z|| 2 + 7||2||i and h(/3) = \\/3\\i, which recovers the Justice Pursuit (2) by 
Lagrangian duality; note that this f(v) is convex because 

\\avi + (1 - a)v 2 - z\\ 2 + 7IMI1 < "(IK ~ z h + 7lMli) + (! - a )(IK - ^Ih + TlNIi)* 

by sub-additivity of norms. The function /(•) can also be any other robust convex loss function including 
the Huber loss function. 

We assume that /(•) and h(-) obey a very mild condition, which is satisfied by any non-trivial loss 
function and regularizer that we know of. In the sequel we use [zi, z 2 ] to denote the concatenation of two 
column vectors z\ and z 2 . 
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Definition 4 (Standard Convex Optimization (SCO) Condition). We say /(•) and h(-) satisfy the SCO 
Condition if Hindoo f(ctv) = oo for all v ^ 0, f([vi, v 2 \) > /([0; v 2 \) for all v 1: v 2 , and h(-) is invariant 
under permutation of coordinates. 

We also assume R > h{(3*) because otherwise the formulation is not consistent even when there are 
no outliers. The following theorem shows that under this assumption, the convex optimization approach 
fails when both X and y are corrupted. We only show this for our first corruption model, since it is a 
special case of the second distributed model. As illustrated in Figure 1, let A and O be the (unknown) 
sets of indices corresponding to authentic and corrupted observations, respectively, and X A and X° be 
the authentic and corrupted rows of the covariate matrix X — [xi, . . . ,x n+ni ] T . The vectors y A and y° 
are defined similarly. Also let A* be the support of (3*. With this notation, we have the following. 

Theorem 5. Suppose f and h satisfy the SCO Condition. When n\ > 1 and k > 1, the adversary can 
corrupt X and y in such a way that for all R with R > h((3*) the optimal solution does not have the 
correct support. 

Proof: Recall that y = [y A ; y°] and X = [X A ; X°] with y A = X A f3* + e, and A* is the true support. 
The adversary fixes some set A disjoint from the true support A* with |A| = |A*|. It then chooses /3 and 
y° such that (3~ K = (3^ c = 0, and y° = X° p with X° to be determined later. By assumption we 
have h0) = h(/3*) < R, so $ is feasible. Its objective value is f(y - X/3) = f([y A - Xgp%,;Q]) < C 
for some finite constant C. The adversary further chooses X° such that X®* = and Xj( is large. Any 
P supported on A* has objective value 

f(y - X(3) = f([y A - X A (3; X°0 -(3)]) = f([y A - X A (3; X°ft,\) > /([0; X° ft.]), 

which can be made bigger than C under the SCO Condition. Therefore, any solution (3 with the correct 
support A* has a higher objective value than (3, and thus is not the optimal solution. ■ 

Our proof proceeds by using a simple corruption strategy. Certainly, there are natural approaches to deal 
with this specific example, e.g., removing entries of X with large values. But discarding such large-value 
entries is not enough, as there may exist more sophisticated corruption schemes where simple magnitude- 
based clipping is ineffective. We illustrate this with a concrete example in the simulation section, where 
Justice Pursuit along with large-value-trimming fails to recover the correct support. Indeed, this example 
serves merely to illustrate more generally the inadequacy of a purely convex-optimization-based approach. 

More importantly, while the idea of considering an unbounded outlier is not new and has been used 
in classical Robust Statistics and more recently in [35], the above theorem highlights the sharp contrast 
between the success of convex optimization (e.g., JP) under corruption in only y, and its complete failure 
when both X and y axe corrupted. Corruptions in X not only break the linear relationship between y 
and X, but also destroy properties of X necessary for existing sparse regression approaches. In the high 
dimensional setting where support recovery is concerned, there is a fundamental difference between the 
hardness of the two corruption models. 

V. The Natural Brute Force Algorithm 

The brute force algorithm (3) can be restated as follows: it looks at all possible n x k submatrices 
of X and picks the one that gives the smallest regression error w.r.t. the corresponding subvector of y. 
Formally, let Xf denote the submatrix of X corresponding to row indices S and column indices A, and 
let y s denote the subvector of y corresponding to indices S. The algorithm solves 

*A0|| 2 (5) 

n, 
k. 



mm \\y — 

s.t. \S\ = 
IAI = 
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Suppose the optimal solution is S, A, 9. Then, the algorithm outputs (5 with (3a = 9 and (5\c = 0. Note 
that this algorithm has exponential complexity in n and k, and S c can be considered as an operational 
definition of outliers. We show that even this algorithm has poor performance and cannot handle large n\. 

To this end, we consider the simple Gaussian design model, where the entries of X A and e are 
independent zero-mean Gaussian random variables with variance - and — , respectively. The - factor 
is simply for normalization and no generality is lost. We consider the setting where o\ = k and 
(3* k , = [1,...,1] T . If rii = 0, existing methods (e.g., Lasso and standard OMP), and the brute force 
algorithm as well, can recover the support of f3* with high probability provided n > k log p. Here and 
henceforth, by with high probability (w.h.p.) we mean with probability at least 1 —p~ 2 . However, when 
there are outliers, we have the following negative result. 

Theorem 6. Under the above setting, if n > Plogp and n\ > then the adversary can corrupt X 
and y in such a way that the brute force algorithm does not output the correct support A*. 

The proof is given in Section IX. We believe the condition n> k 3 \ogp is an artifact of our proof and 
is not necessary. This theorem shows that the brute force algorithm can only handle O (?) outliers. In 
the next section, we propose a simple, tractable algorithm that outperforms this brute force algorithm and 

can handle O (jfj^j outliers. 

VI. Proposed Approach: Robust Matching Pursuit 

The discussion in the last two sections demonstrates that standard techniques for high-dimensional 
statistics and robust statistics are inadequate to handle our problem. As mentioned in the introduction, 
we believe the key to obtaining an effective robust estimator for high-dimensional data, is simultaneous 
structure identification and outlier rejection. In particular, for the sparse recovery problem where the 
observations Z,- L = (x i: yi) reside in a high-dimensional space, it is crucial to utilize the low-dimensional 
structure of f3* and perform outlier rejection in the "right" low-dimensional space in which f3* lies. In 
this section, we propose a candidate algorithm, called Robust Matching Pursuit (RoMP), which is based 
on this intuition. 

Standard MP estimates the support of f3* sequentially. At each step, it selects the column of X which 
has the largest (in absolute value) inner product with the current residual r, and adds this column to the 
set of previously selected columns. The algorithm iterates until some stopping criterion is met. If the 
sparsity level k of (3* is known, then one may stop MP after k iterations. 

To successfully recover the support of f3*, standard MP relies on the fact that for well-conditioned 
X, the inner product h(j) = {r,Xj) is close to f3j, and thus a large value of h(j) indicates a nonzero 
Pj. When outliers are present, MP fails because the h(j)'s may be distorted significantly by maliciously 
corrupted x/s and y/s. To protect against outliers, it is crucial to obtain a robust estimate of h(j). This 
motivates our robust version of MP. 

The proposed Robust Matching Pursuit algorithm (RoMP) is summarized in Algorithms 1 and 2. Similar 
to standard MP, it selects the columns of X with highest inner products with the residual. There are two 
main differences from standard MP. The key difference is that we compute a robust version of inner 
product by trimming large points. Also, there is no iterative procedure - we take the inner products 
between all the columns of X and the response vector y and selects the top k ones; this leads to a simpler 
analysis. 

The key idea behind RoMP is that it effectively reduces a high-dimensional robust regression problem 
to a much easier low-dimensional (2-D) problem, one that is induced by the sparse structure of (3* . Outlier 
rejection is performed in this low-dimensional space, and the support of j3* is estimated along the way. 
Hence, this procedure fulfils our previous intuition of simultaneous structure identification and outlier 
rejection. 

Our algorithm requires two parameters, n\ and k. We discuss how to choose these parameters after we 
present the performance guarantees in the next section. 
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Algorithm 1 Robust Matching Pursuit (RoMP) 
Input: X,y,k,ni. 

For j — 1, . . . ,p, compute the trimmed inner product (see Algorithm 2): 

h(j) = trimmed-inner-product(y,Xj, ni). 

Sort {|/i(j)|} and select the k largest ones. 
Let A be the set of selected indices. 
Set (5j = h(j) for j e A and otherwise. 
Output: $ 



Algorithm 2 Trimmed Inner Product 

Input: aeR N ,be R N , m 

Compute qi = afci, i — 1, . . . , N. 

Sort {\qi\} and select the smallest (N — n-i) ones. 

Let Vt be the set of selected indices. 

Output: h = 'ZienPi- 



VII. Performance Guarantees for RoMP 

We are interested in finding conditions for (p, k, n, n\) under which RoMP is guaranteed to recover f3* 
with correct support and small error. We consider the following sub-Gaussian design model. Recall that 
a random variable Z is sub-Gaussian with parameter a if E[exp(tZ)] < exp(t 2 cr 2 /2) for all real t. 

Definition 7 (Sub-Gaussian design). Suppose the entries of X A are i.i.d. zero-mean sub-Gaussian variables 
with parameter -4= and variance ^, and the entries of the additive noise are i.i.d. zero-mean sub-Gaussian 

2 

variables with parameter ^f= and with variance ^. 

Note that this general model covers the case of Gaussian, symmetric Bernoulli, and any other distribu- 
tions with bounded support. 



A. Guarantees for the Distributed Corruption Model 

The following theorem characterizes the performance of RoMP, and shows that it can recover the correct 
support even when the number of outliers scales with n. In particular, this shows RoMP can tolerate an 
0(1/Vk) fraction of distributed outliers. Recall that with high probability means with probability at least 
1- /< 2 - 

Theorem 8. Under the Sub-Gaussian design model and the distributed corruption model, the following 
hold with high probability. 

(1) The output of RoMP satisfies the following £ 2 error bound: 

(2) If the nonzero entries of f3* satisfy \/3*\ 2 > (\\/3*\\l/n)\ogp (l + a 2 J \\/3*\\l) , then RoMP correctly 
identifies the nonzero entries of (3* provided 

n > klogp- (l + a 2 e /\\/l*\\l), and 
^ < l/(jk(l + a*/ logp). 
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The proof of the theorem is given in Section IX. A few remarks are in order. 

1) We emphasize that knowledge of the exact number of outliers is not needed - n\ can be any 

upper bound of the number of outliers, because by definition the adversary can change n\ entries 
in each column arbitrarily, and changing less than n\ of them is of course allowed. The theorem 
holds even if there are less than n\ outliers. Of course, this would result in sub-optimal bounds in 
the estimation due to over-conservativeness. In practice, cross-validation could be quite useful here. 

2) We wish to note that essentially all robust statistical procedures we are aware of have the same 
character noted above. This is true even for the simplest algorithms for robustly estimating the mean. 
If an upper bound is known on the fraction of corrupted points, one computes the analogous trimmed 
mean. Otherwise, one can simply compute the median, and the result will have controlled error (but 
will be suboptimal) as long as the number of corrupted points is less than 50% - something which, 
as in our case, and every case, is always impossible to know simply from the data. 

3) In a similar spirit, the requirement of the knowledge of k can also be relaxed. For example, if we 
use some k' > k instead of k, then under the theorem continues to hold in the sense that RoMP 
identifies a superset (with size k') of the support of (3*, and the £ 2 error bound holds with k replaced 
by k'. Standard procedures of estimating the sparsity level (e.g. cross-validation) can also be applied 
in our setting. 

4) Also note that the term a 2 e / \\P*\\ 2 has a natural interpretation of signal to noise ratio. 



B. Guarantee for the Row Corruption Model 

It follows directly from Theorem 8 that RoMP can handle ni/2 corrupted rows under the same condition. 
In fact, a slightly stronger results holds: RoMP can handle n\ corrupted rows. This is the content of the 
follow theorem, with its proof given in Section IX. 

Corollary 9. Under the Sub-Gaussian design model and the row corruption model with at most n\ 
corrupted rows, the conclusions of Theorem 8 holds. 

Therefore, in particular, our algorithm is orderwise stronger than the Brute Force algorithm, in terms 
of the number of outliers it can tolerate while still correctly identifying the support. 



VIII. Experiments 

In this section, we report some simulation results for the performance of RoMP (Algorithm 1) on 
synthetic data. The performance is measured in terms of support recovery (the number of non-zero locations 
of (3* that are correctly identified), and also relative £ 2 - error (||/3 — ^Ib/H^lb)- The authentic data are 
generated under the sub-Gaussian Design model using Gaussian distribution with p = 4000, n = 1600, k = 
10 and cx e = 2, with the non-zero elements of /3* being randomly assigned to ±1. 

For comparison, we also apply standard Lasso and JP [20], [22] to the same data. For these algorithms, 
we search for the values of the tradeoff parameters A, 7 that yield the smallest £ 2 - errors > an d then estimate 
the support using the location of the largest k entries of (5. It is also interesting to ask whether a simple 
modification of JP would perform well. While not analyzed in any of the papers that discuss JP, we 
consider JP with two different pre-processing procedures, both of which aim to detect and correct the 
corrupted entries in X directly. The first one, dubbed JP-fill, finds the set E of the largest — portion of 
the entries of X, and then scales them to have unit magnitude. The second one, dubbed JP-row, discards 
the rii rows of X that contain the most entries in E. 

The corrupted rows (X°,y°) are generated by the following procedure: 
Let 

6* = arg min \\y A - Xf A * )T 9\\ 2 . 
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Set X®» = -^A, where A is a random ±1 matrix of dimension m x k, and = X®*(—/3*). 
For z = 1, . . . , ni, further set 

x° m < = {y?/(Bje*)) ■ Bj, 

where Bi is a (n — A;)-vector with i.i.d. standard Gaussian entries. 
The results are shown in Figures 2. It can be observed that RoMP performs better than Lasso and JP 
for both metrics, especially when the number of outliers is large. The ^-errors of Lasso and JP flatten 
out because it returns a near-zero solution. The pre-processing procedures do not significantly improve 
performance of JP, which highlights the difficulty of performing outlier detection in high dimensions. 
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Figure 2: Panel (a) shows the relative support recovery for different methods, while Panel (b) shows the 
£ 2 recovery errors. In both panels, the error bars correspond to one standard deviation. RoMP signficantly 
outperforms all other methods under both metrics. We note that the £ 2 error of the other methods flattens 
out and error bars shrink to zero, because after about 4% outliers, they return a near-zero solution. 



IX. Proofs 

In this section, we turn to the proofs of Theorem 6, Theorem 8, and Corollary 9, with some technical 
aspects of the proofs deferred to the appendix. 

A. Proof of Theorem 2 

For simplicity we assume A* = {1, . . . , k}, A = {1, . . . , n}, and O = {n+1, . . . , n+ni}. We will show 
that the adversary choose y° and X° in such a way that any "correct" solution of the form (9, S, A*) (i.e., 
with the correct support A*) is not optimal because an alternative solution (6>, S, A) with 9 = [1, . . . , 1] T , 
<S = {ni + 1, . . . , n + ni}, A = {2, . . . , k, k + 1} has smaller objective value. 

Now for the details. The adversary chooses (y°) i = Vk for all i, X® t = 0, and X® +1 = y°, hence 
y° — X'PO = 0. To compute the objective values of the "correct" solution and the alternative solution, we 
need a simple technical lemma, which follows from standard results for the norms of random Gaussian 
matrix. The proof is given in the appendix. 

Lemma 10. Ifn> k 3 \ogp, we have 

\\e + X*6\\l > (l-^))a 2 e ,V6eR k 

: <- h)(i-*)m^ 

with high probability. 



oS/O 



+ Xf /0 - X 



s/o 
fc+l 
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Using the above lemma, we can upper-bound the objective value of the alternative solution: 



y s - xp 



P 



= + 



xfe 



s/o 



+ 



.s/o 



xf /0 § 



' J- 



S/O 



2 
2 

X 



s/o 
fe+l 



< 



e s/o 



1 + 



. v s/o v s/o 
+ A 1 _A fc+l 



A: 



ni 
n 



K 2 + 2). 



(6) 



To lower-bound the objective value of solutions of the form (9,S, A*), we distinguish two cases. If 
S fl O 7^ 0, then the objective value is 



-^A* ^ 1 1 2 — 



„SnO 



v5nO/i|| 2 



,.sno\ 



> k 



(7) 



If S fl (9 = 0, we have 5 = .4 and thus 

||/-*A.0| 



= P 



> 



| e +x^(r-^)|i; 



(8) 



where we use the lemma. When n\ > ^ and of = A:, we have min {A:, (l — |) cr^} > (l + |) (l — ^) (<7g + 
2). Combining (6) (7) and (8) concludes the proof. 



B. Proof of Theorem 3 

We prove Theorem 3 in this section. We need two technical lemmas. The first lemma bounds the 
maximum of independent sub-Gaussian random variables. The proof follows from the definition of sub- 
Gaussianity and Chernoff bound, and is given in the appendix. 

Lemma 11. Suppose Zi, . . . , Z m are m independent sub-Gaussian random variables with parameter a. 
Then we have maxj=i j ... jm \Zi\ < Aay/\ogm + log p. with high probability. 

The second lemma is a standard concentration result for the sum of squares of independent sub-Gaussian 
random variables. It follows directly from Eq. (72) in [23]. 

Lemma 12. Let Yi, . . . ,Y n be n Ltd. zero-mean sub-Gaussian random variables with parameter ^ and 
variance at most -. Then we have 



£i?-n<o 



i=i 



logp 



n 



with high probability for some absolute constant c\. Moreover, if Zi, . . . , Z n are also i.i.d. zero-mean sub- 
Gaussian random variables with parameter ^ and variance at most ^, and independent of Yi, . . . , Y n , 
then 



|^Y^| <c 2 

i=l 

with high probability for some absolute constant c 2 . 



\ogp 



n 
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Remark 13. When the above inequality holds, we write £" =1 Y? « 1 ± and Y^i « ±y^p. 

w.h.p. 

Now consider the trimmed inner product between the jth column of X and y. Let A, is the set of 
index i such that and ?/j are both not corrupted. By assumption \Aj\ > n. By putting \Aj\ — n clean 
indices in Ap we may assume \Aj\ = n without loss of generality. By prescription of Algorithm 2, we 
can write h(j) as 

h U) = E X ^ Vi ~ E Xi i yi + E Xi i yi - 



ieAj 



ie trimmed 
inliers 



ie remaining 
outliers 



We estimate each term in the above sum. 
1) Observe that 

x ijVi = J2 x * ( E« + e ) = E x *pj + E x * ( E x *^ + e ) • 



i€Aj 



ieA-i 



(a) Because the points in Aj obeys the Sub-Gaussian model, Lemma 12 gives ^2 ieA Xfjfi* w 
(3* (l ± y^logp) w.h.p. 

(b) On the other hand, because X ik and JQj are independent when k ^ j, and Zj = ^ikf^k + e 
are i.i.d. sub-Gaussian with parameter and standard deviation at most ^/ (||/>*||2 + of) / n ' we a PPly 
Lemma 12 to obtain J2ieAj Xi i Zi ~ ± 7h\/ (ll/ 5 *^ + a e) lo gP w - n -P- 



2) Again due to independence and sub-Gaussianity of points in Aj, Lemma 11 gives max ieAj \Xij\ 
a/ (log p)/n w.h.p. and max ieA . \yi\ < (log p/n) (\\/3*\\l + erf) w.h.p. It follows that w.h.p. 



< 



E x ^ 



ietrimmed 
inliers 



< n 1 ( max IX^ 



i£A 



)( 



max % 
■eA |y ' 



< 



ni 



3) By prescription of the trimming procedure, either all outliers are trimmed, or the remaining outliers 
are no larger than the trimmed inliers. It follows from the last equation that w.h.p. 



E 

ie remaining 
outliers 



Xijyi 



< 



\ x ijVi\< E \ x > 



vVil ^ ri! 



logp 



ie remaining 
outliers 



ietrimmed 
inliers 



n 



i z 1 9 



Combining pieces, we have for all j — 1, . . . ,p, 



\Kj) -%\<\%\ ^fl^gp + -Woi^ii* + ^ log ^ + ni • nfVfli^"' + <>■ (9) 



If RoMP correctly picks an index j in the true support A*, then the error in estimating j3j is bounded 
by the expression above. If RoMP picks some incorrect index j not in A*, then the difference between 
the corresponding (3j and the true f3*, that should have been picked is still bounded by the expression 
above (up to constant factors). Therefore, we have 



< 



E 

jeA* 



\%\ Jl^P+j^y/(\ml + ^) logp +n 1 l ^(mi + a*) 
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The first part of the theorem then follows after straightforward algebra manipulation. On the other hand, 
RoMP picks the correct support as long as \h(j)\ > \h(j')\ for all j G A*,j' e (A*) c . In view of Eq.(9), 
we require 



One verifies that the above inequalities are satisfied under the conditions in the second part of the theorem. 
C. Proof of Corollary 1 

A careful examination of the proof of Theorem 2 in the last section shows that, when there are n\ 
corrupted rows, the set Aj still has cardinality at least n, and the proof thus holds under the row corruption 
model. 



Adversarial corruption seems to be significantly more difficult than corruption independent from the 
original data, and moreover, corruption in X as well as y appears more challenging than corruption only in 
y. To the best of our knowledge, no prior existing algorithms have provable performance in this setting, or 
in the more difficult yet setting of distributed corruption. This paper provides the first results for both these 
settings. Our results outperform Justice Pursuit, as well as the exponential time Brute Force algorithm; 
more generally we show that no convex optimization based approach improve on the results we provide. 
Generalizing our results to obtain a sequential OMP-like algorithm, and on the other side, understanding 
converse results, are important next steps. 
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Appendix 

Let & = [a e \5 T ] T . We can write 1 1 e H- J^T^ <5 1 1 ^ = \\Zi9'\\ 2 2 with Z x = ±e\Xfi^. Note that Z x is an 

n x (k + 1) matrix with i.i.d. A/"(0, yj entries, whose smallest singular value can be bounded using standard 
results. For example, using Lemma 5.1 in [l]with = Z u N = k + 1, T = {1, . . . , N}, 5 

c (5/2) = 1/288 A; 2 , we have 



?>k 



and 



\ZiO'\\l < 



1 

3k 



Vl|2 



< 1 



with probability at least 



1 _ 2e ^ n - (fc+1)ln(36fc) > 1 - 2p- 3 



provided n > 576(A; + l) 3 ln(36p). This proves the first inequality. 
Let Z = maxj Z { . By definition of sub-Gaussianity, we have 



E 



= E\m&xe tz * /a 

< j2 E [ etZt/a ] 



< me t2 ' 2 
= e 



t 2 /2+logm 
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It follows from Markov Inequality that 

P(Z > at) = P(e^ /a > e* 



< e E 
e 



JZ/o 



< e ~t 2 +t 2 /2+\ogm 



By symmetry we have 

P(mmZi< -at) < e-^ t2+logm , 

i 

so a union bound gives 

P(max \Zi\ > at) < P(max Z { > at) + P(min Z> < -at) 

i i i 

< 2e~5* 2+logm 
Taking t = 4 v /logm + logp yields the result. 



